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Abstract 

Background: Snake venoms have significant impacts on liuman populations through the morbidity and mortality 
associated with snakebites and as sources of drugs, drug leads, and physiological research tools. Genes expressed by 
venom-gland tissue, including those encoding toxic proteins, have therefore been sequenced but only with relatively 
sparse coverage resulting from the low-throughput sequencing approaches available. High-throughput approaches 
based on 454 pyrosequencing have recently been applied to the study of snake venoms to give the most complete 
characterizations to date of the genes expressed in active venom glands, but such approaches are costly and still 
provide a far-from-complete characterization of the genes expressed during venom production. 

Results: We describe the denovo assembly and analysis of the venom-gland transcriptome of an eastern 
diamondback rattlesnake {Crotalus adamanteus) based on 95,643,958 pairs of quality-filtered, 100-base-pair lllumina 
reads. We identified 1 23 unique, full-length toxin-coding sequences, which cluster into 78 groups with less than 1 % 
nucleotide divergence, and 2,879 unique, full-length nontoxin coding sequences. The toxin sequences accounted for 
35.4% of the total reads, and the nontoxin sequences for an additional 27.5%. The most highly expressed toxin was a 
small myotoxin related to crotamine, which accounted for 5.9% of the total reads. Snake-venom metalloproteinases 
accounted for the highest percentage of reads mapping to a toxin class (24.4%), followed by C-type lectins (22.2%) 
and serine proteinases (20.0%). The most diverse toxin classes were the C-type lectins (21 clusters), the snake-venom 
metalloproteinases (16 clusters), and the serine proteinases (14 clusters). The high-abundance nontoxin transcripts 
were predominantly those involved in protein folding and translation, consistent with the protein-secretory function 
of the tissue. 

Conclusions: We have provided the most complete characterization of the genes expressed in an active snake 
venom gland to date, producing insights into snakebite pathology and guidance for snakebite treatment for the 
largest rattlesnake species and arguably the most dangerous snake native to the United States of America, C. 
adamanteus. We have more than doubled the number of sequenced toxins for this species and created extensive 
genomic resources for snakes based entirely on de novo assembly of lllumina sequence data. 



Background 

Human envenomation by snakes is a worldwide issue 
that claims more than 100,000 lives per year and exacts 
untold costs in the form of pain, disfigurement, and loss 
of limbs or limb function [1-3]. Despite the significance 
of snakebites, their treatments have remained largely 
unchanged for decades. The only treatments currently 
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available are traditional antivenoms derived from antisera 
of animals, usually horses [4], innoculated with whole 
venoms [5,6]; such an approach is the only readily avail- 
able option for largely uncharacterized, complex mixtures 
of proteins such as snake venoms. Although often lifesav- 
ing and generally effective against systemic effects, these 
antivenoms have little or no effect on local hemorrhage 
or necrosis [7-9], which are major aspects of the pathol- 
ogy of viperid bites and can result in lifelong disability 
[4,5]. These traditional treatments also sometimes lead to 
adverse reactions in patients [6]. Advances in treatment 
approaches will depend on a complete knowledge of the 
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nature of the offending toxins, but current estimates of 
the numbers of unique toxins present in snake venoms 
are in excess of 100 [10], a number not approached in 
even the most extensive venom-characterization efforts 
to date [11]. 

The significance of snake venoms extends well beyond 
the selective pressures they may directly impose upon 
human populations. Snake venoms have evolutionary 
consequences for those species that snakes prey upon 
[12,13], as well as species that prey upon the snakes 
[14], and their study can therefore provide insights into 
predator-prey coevolution. Snake venom components 
have been leveraged as drugs and drug leads [15-17] and 
have been used directly as tools for studying physiologi- 
cal processes such as pain reception [18]. In addition to 
the significance of the toxins, the nature of the extreme 
specialization of snake venom glands for the rapid but 
temporary production and export of large quantities of 
protein could provide insights into basic mechanisms of 
proteostasis, the breakdown of which is thought to con- 
tribute to neurodegenerative diseases such as Parkinson's 
and Alzheimer's [19]. 

The eastern diamondback rattlesnake {Crotalus 
adamanteus) is a pit viper native to the southeastern 
United States and is the largest member of the genus 
CrotaluSy reaching lengths of up to 2.44 m [20]. The diet 
of C adamanteus consists primarily of small mammals 
(e.g., squirrels, rabbits, and mouse and rat species) and 
birds, particularly ground-nesting species such as quail 
[20]. Because of its extreme size and consequent large 
venom yield, C. adamanteus is arguably the most dan- 
gerous snake species in the United States and is one of 
the major sources of snakebite mortality throughout its 
range [21]. Crotalus adamanteus has recently become of 
interest from a conservation standpoint because of its 
declining range, which at one time included seven states 
along the southeastern Coastal Plain [22]. This species 
has now apparently been extirpated from Louisiana and 
is listed as endangered in North Carolina [23,24]. As a 
consequence of recent work by Rokyta et al. [11] based 
on 454 pyrosequencing, the venom of C adamanteus is 
among the best-characterized snake venoms; 40 toxins 
have been identified. 

Transcriptomic characterizations of venom glands 
of snakes [25-28] and other animals [29-32] have 
relied almost exclusively on low- throughput sequencing 
approaches. Sanger sequencing, with its relatively long, 
high-quality reads, has been the only method available 
until recently and has provided invaluable data on the 
identities of venom genes. Because venomous species 
are primarily nonmodel organisms, high-throughput 
sequencing approaches have been slow to pervade 
the field of venomics (but see Hu et al. [33]), despite 
becoming commonplace in other transcriptomic-based 



fields. Rokyta et al. [11] recently used 454 pyrosequenc- 
ing to characterize venom genes for C. adamanteus. More 
recently, Durban et al. [34] used 454 sequencing to study 
the venom-gland transcriptomes of a mix of RNA from 
eight species of Costa Rican snakes. Whittington et al. 
[30] used a hybrid approach with both 454 and lUumina 
sequencing to characterize the platypus venom-gland 
transcriptome, although they had a reference genome 
sequence, making de novo assembly unnecessary. Pyrose- 
quencing is expensive and low-throughput relative to 
Illumina sequencing, and the high error rate, particu- 
larly for homopolymer errors [35], significantly increases 
the difficulty of identifying coding sequences without 
reference sequences. 

We sequenced the venom-gland transcriptome of the 
eastern diamondback rattlesnake with Illumina technol- 
ogy using a paired-end approach coupled with short 
insert sizes effectively to produce longer, high-quality 
reads on the order of approximately 150 nt to facili- 
tate de novo assembly (an approach similar to that of 
Rodrigue et al. [36] for metagenomics). The difference in 
read length from that of 454 sequencing was compensated 
for by the increase of more than two orders of magni- 
tude in the number of reads. We demonstrated de novo 
assembly and analysis of a venom-gland transcriptome 
using only Illumina sequences and provided a compre- 
hensive characterization of both the toxin and nontoxin 
genes expressed in an actively producing snake venom 
gland. 

Results and discussion 

Venom-gland transcriptome sequencing and assembly 

We generated a total of 95,643,958 pairs of reads that 
passed the Illumina quality filter for > 19 gigabases (Gb) of 
sequence from a cDNA library with an average insert size 
of ->170 nt. Of these reads, 72,114,709 (75%) were merged 
(see Methods) on the basis of their 3' overlap (Figure 1), 
yielding composite reads of average length 142 nt with 
average phred qualities >40 and a total length >10 Gb. 
This merging of reads reduced the effective size of the data 
set without loss of information and provided long reads to 
facilitate accurate assembly. 

Our first approach to transcriptome assembly was 
aimed at identifying toxin genes. We attempted to use as 
many of the data as possible to ensure the identification 
of even the lowest- abundance toxins. To this end, we con- 
ducted extensive searches of assembly parameter space for 
both ABySS [37,38] (Table 1) and Velvet [39] on the basis 
of the full set of both merged and unmerged reads. We 
used the assemblies with the best NSQ values for further 
analysis. For Velvet, the assembly using a k-mev size of 91 
was best (A/^50 = 408); this assembly was subsequently 
analyzed with Oases [40]. For ABySS, the best k-mev value 
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Proposed fragment length Position in fragment 

Figure 1 Merging overlapping reads. (A) Reads are slid along each other until the number of matches exceeds the significance threshold. In the 
example shown, the optimal overlap is 74 nucleotides (nt). (B) The quality of reads declines dramatically toward their 3' ends, where overlap occurs 
if the fragment length is less than twice the read length, allowing the actual quality to be much higher than the nominal values. The example 
shown is the average of pairs that overlap by exactly 50 nt. 



Table 1 ABySS assembly summaries 
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Table 1 ABySS assembly summaries (Continued) 
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was also 91 {N50 = 2,007), but because the performance 
in terms of full-length transcripts appeared to depend 
strongly on the coverage (c) and erode (e) parameters, we 
further analyzed the k = 91 assemblies with c = 10 
and e = 2, c = 100 and e = 100, and c = 1000 and 
e = 1000. We identified all full-length toxins by means of 
blastx searches on the results of all four assemblies. 

As part of our first approach, we also performed 
four independent de novo transcriptome assemblies with 
NGen: three with 20 million merged reads each and one 
with the remaining 12,114,709 merged reads (Table 2). We 
identified all full-length toxins from all four assemblies. 
Given that all three assembly methods tended to generate 
a large number of fragmented toxin sequences, apparently 
because of retained introns and possibly alternative splic- 
ing, we developed and implemented a simple hash-table 



approach to completing partial transcripts, which we will 
refer to as Extender (see Methods). We used Extender on 
partial toxin sequences identified for two of the four NGen 
assemblies. We also annotated the most abundant full- 
length nontoxin transcripts for the three assemblies based 
on 20 million reads. After combining all of the anno- 
tated toxin and nontoxin sequences from the ABySS, Vel- 
vet, and NGen assemblies and eliminating duplicates, we 
had 72 unique toxin sequences and 234 unique nontoxin 
sequences. The paucity of full-length annotated nontox- 
ins reflects our focus on toxin sequences rather than their 
absence in the assemblies. 

Our second approach to transcriptome assembly 
was designed to annotate as many full-length coding 
sequences (toxin and nontoxin) as possible and to build 
a reference database of sequences to facilitate the future 



Table 2 NGen assembly summaries 
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analysis of other snake venom-gland transcriptomes. We 
found that NGen was much more successful at producing 
transcripts with full-length coding sequences but also 
that it was quite inefficient when the coverage distribu- 
tion was extremely uneven (see Figure 2). Feldmeyer et 
al [41] also found NGen to have the best assembly per- 
formance with Illumina data. We sought therefore first 
to eliminate the transcripts and corresponding reads for 
the extremely high-abundance sequences. To do so, we 



employed Extender as a de novo assembler by starting 
from 1,000 individual high-quality reads and attempting 
to complete their transcripts (see Methods). From 1,000 
seeds, we identified 318 full-length coding sequences with 
213 toxins and 105 nontoxins. After duplicates were elim- 
inated, this procedure resulted in 58 unique toxin and 44 
unique nontoxin full-length transcripts. These sequences 
were used to filter the corresponding reads from the full 
set of merged reads with NGen. We then performed a de 
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Figure 2 Domination of the C. ac/amanteus venom-gland transcriptome by toxin transcripts. The 123 unique toxin sequences were clustered 
into 78 groups witli less than 1% nucleotide divergence for estimation of abundances. (A) The vast majority of the extremely highly expressed 
genes were toxins. The inset shows a magnification of the top 200 transcripts. (B) Expression levels of individual toxin clusters are shown with toxin 
classes coded by color. The toxin clusters are in the same order as in Table 3. 
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novo transcriptome assembly on 10 million of the filtered 
reads with NGen, annotated full-length transcripts from 
contigs comprising > 200 reads with significant blastx 
hits, and used the resulting unique sequences as a new fil- 
ter. This process of assembly, annotation, and filtering was 
iterated two more times. The end result was 91 unique 
toxin and 2,851 unique nontoxin sequences. 

The results from both assembly approaches were 
merged to yield the final data set. The first approach 
produced 72 unique toxin and 234 unique nontoxin 
sequences, and the second 91 toxin and 2,851 non- 
toxin sequences. The merged data set consisted of 
123 unique toxin sequences and 2,879 nontoxins that 
together accounted for 62.9% of the sequencing reads 
(Figure 3). 



Toxin transcripts 

We identified 123 individual, unique toxin transcripts with 
full-length coding sequences. To estimate the abundances 
of these transcripts in the C. adamanteus venom-gland 
transcriptome, we clustered them into 78 groups with less 
than 1% nt divergence (Table 3). Clusters could include 
alleles, recent duplicates, or even sequencing errors, 
which are characteristic of high-throughput sequencing 
[42]. For longer genes, clusters might also include different 
combinations of variable sites that are widely separated in 
the sequence. We chose 1% as a practical, but arbitrary, 
cut-off for defining clusters. Mapping reads back to more 
similar sequences to estimate abundances would be prob- 
lematic because reads could not be uniquely assigned to 
a particular sequence. The true number of toxin genes 
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Figure 3 Expression levels of major classes of toxins and nontoxins. More than 60% of the total reads have been accounted for with full-length 
annotated transcripts. (A) The major toxin classes were the CTLs, SVSPs, MYO, and SVMPs (types II and III). (B) As expected for a protein-secreting 
tissue, the venom gland expresses an abundance of proteins involved in proteostasis. 
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Table 3 Expression levels of full-length toxin clusters 


Rank Cluster name Cluster size Length 
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Table 3 Expression levels of full-length toxin clusters (Continued) 
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20 SVMPIII-3 


3 2401 


6.00 X 


10-^ 


1.698 


a:JU173712, 












b:JU173713, 












cJUl 73714 


21 SVSP-9 


1 10270 


5.96 X 


10-^ 


1.685 


JUl 73738 


22 SVMPII-4 


1 2016 


5.41 X 


10-^ 


1.530 


JUl 73693 


23 SVSP-8 


1 3524 


5.41 X 


10-^ 


1.529 


JUl 73737 


24 CTL-7 


1 763 


5.40 X 


10-^ 


1.528 


JUl 73655 


25 CTL-18 


1 775 


5.33 X 


10-^ 


1.509 


JUl 73633 


26 CTL-1 


1 763 


5.15 X 


10-^ 


1.458 


JU173635 


27 SVSP-12 


1 957 


5.04 X 


10-^ 


1.424 


JU173723 


28 CTL-6 


1 607 


4.80 X 


10-^ 


1.358 


JUl 73654 


29 CRISP 


1 1579 


4.66 X 


10-^ 


1.317 


JUl 73623 


30 SVMPIII-7 


1 2343 


4.31 X 


10-^ 


1.218 


JU173719 


31 SVMPII-8 


1 1863 


4.24 X 


10-^ 


1.198 


JUl 73704 


32 PLA2-6 


1 834 


4.15 X 


10-^ 


1.174 


JUl 73681 


33 SVSP-7 


3 2108 


3.59 X 


10-^ 


1.016 


aJUl 73734, 












b:JUl 73735, 












cJU 173736 


34 CTL-1 5 


1 947 


3.59 X 


10-^ 


1.016 


JUl 73630 


35 PLA2-2 


1 872 


3.58 X 


10-^ 


1.013 


JUl 73677 


36 CTL-1 0 


1 587 


3.49 X 


10-^ 


0.987 


JUl 73624 


37 CTL-1 4 


1 1246 


3.36 X 


10-^ 


0.951 


JUl 73629 


38 CTL-1 3 


1 680 


3.15 X 


10-^ 


0.891 


JUl 73628 


39 PLA2-5 


1 651 


2.90 X 


10-^ 


0.819 


JUl 73680 


40 CTL-9 


2 755 


2.88 X 


10-^ 


0.814 


a:JU 173659, 












bJU 173660 


41 BPP 


1 1300 


2.57 X 


10-^ 


0.726 


JU173621 


42 SVMPIII-1 


2 2665 


2.41 X 


10-^ 


0.683 


aJUl 73705, 












b:JU 173706 


43 SVSP-14 


1 1732 


2.23 X 


10-^ 


0.631 


JU173725 


44 SVMPIII-4 


2 2433 


2.16 X 


10-^ 


0.611 


a:JU173715, 












b:JU173716 


45 CTL-2 


2 749 


1.90 X 


10-^ 


0.538 


aJUl 73638, 












bJU 173639 


46 SVMPIII-8 


1 2150 


1.71 X 


10-^ 


0.484 


JUl 73720 


47 VESP 


1 1603 


1.71 X 


10-^ 


0.483 


JUl 73741 


48 SVMPIII-5 


1 2339 


1.59 X 


10-^ 


0.449 


JU173717 
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Table 3 Expression levels of full-length toxin clusters (Continued) 



49 


SVSP-4 


2 


2152 


1.34 X 10-^ 


0.379 


a:JUl 73730, 














bJUl 73731 


50 


CTL-20 




785 


1.19 X 10-^ 


0.336 


JU1 73636 


51 


NUC 




2706 


1.16 X 10-^ 


0.327 


JUl 73671 


52 


SVSP-5 




1890 


1.12 X 10-^ 


0.317 


JUl 73732 


53 


SVMPIII-6 




2367 


1.10 X 10-^ 


0.311 


JU173718 


54 


CTL-21 




824 


1.09 X 10-^ 


0.307 


JUl 73637 


55 


CTL-19 




618 


1.05 X 10-^ 


0.296 


JUl 73634 


56 


NF 




1395 


9.53 X 10-2 


0.269 


JUl 73669 


57 


CTL-12 


2 


825 


9.40 X 10-2 


0.266 


aJUl 73626, 














bJUl 73627 


58 


SVSP-2 


1 


1675 


7.35 X 10-2 


0.208 


JUl 73727 


59 


CTL-5 


3 


637 


5.83 X 10-2 


0.165 


a:JUl 73651, 














b:JU 173652, 














cJU 173653 


60 


PDE 




2743 


5.62 X 10-2 


0.159 


JUl 73674 


61 


CTL-1 1 




625 


4.86 X 10-2 


0.137 


JUl 73625 


62 


PLA2-3 




957 


3.22 X 10-2 


0.091 


JUl 73678 


63 


CREGF 




1945 


3.09 X 10-2 


0.087 


JUl 73622 


64 


SVSP-10 




1815 


2.22 X 10-2 


0.063 


JUl 73721 


65 


HYAL-1 




2545 


2.10 X 10-2 


0.059 


JUl 73662 


66 


KUN 




1698 


1.14 X 10-2 


0.032 


JUl 73666 


67 


HYAL-2 




1302 


7.83 X 10-3 


0.022 


JUl 73663 


68 


KUN-1 




2575 


5.11 X 10-3 


0.014 


JUl 73664 


69 


VEGF-1 




906 


4.99 X 10-3 


0.014 


aJUl 73739, 














b:JU 173740 


70 


SVSP-1 1 




1207 


4.46 X 10-3 


0.013 


JUl 73722 


71 


GC 




1 730 


3.65 X 10-3 


0.01 0 


11 11 7^fS61 


72 


PDE-6 




3691 


3.36 X 10-3 


0.010 


JUl 73673 


73 


NGF 




951 


3.14 X 10-3 


0.009 


JUl 73670 


74 


KUN-2 




1438 


2.68 X 10-3 


0.008 


JUl 73665 


75 


PDE-4 




2633 


2.24 X 10-3 


0.006 


JUl 73672 


76 


VF 




5087 


1.70 X 10-3 


0.005 


JUl 73742 


77 


WAP 




627 


5.60 X 10-4 


0.002 


JUl 73743 


78 


CTL-1 7 




774 


3.80 X 10-4 


0.001 


JUl 73632 



for C. adamanteus probably lies somewhere between 78 
and 123. This range is at the lower end of the number 
of unique toxins typically identified for viperids by means 
of proteomic techniques [10], which may indicate that 
the venom of C adamanteus is less complex than that of 
other species. Alternatively, posttranscriptional processes 
such as alternative splicing or posttranslational modifica- 
tions could significantly increase the diversity of toxins 
present in the venom. Our identified toxins accounted for 
35.4% of the total reads (Figure 3), and the vast majority 



of the extremely high-abundance transcripts were those 
encoding toxin proteins (Figure 2A). We named toxins 
with a combination of a toxin-class abbreviation, a cluster 
number, and, if the cluster had more than a single member, 
a lower-case letter to indicate the member of the cluster 
(e.g., CTL-3b). 

We used the number or percentage of reads mapping 
to a particular transcript as a measure of its abundance. 
Although average coverage might be a more appropriate 
proxy for the number of copies of a given transcript 
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present, because it accounts for differences in transcript 
lengths, we prefer read counts as a measure of the expres- 
sion expenditure on a given transcript because they better 
reflect the energetic cost associated with producing the 
encoded protein and are consistent with previous worI< 
using low-throughput sequencing (see, e.g., Pahari et al. 
[25]). In addition, this measurement should more closely 
match proteomic-based measurements of the contents of 
venom components (see, e.g., Gibbs et al. [43]) which 
come in the form of the percentages of total peptide bonds 
in the sample. 

Snake venom metalloproteinases 

We identified 39 unique sequences and 16 clusters of 
snake-venom metalloproteinases (SVMPs) that accounted 
for 24.4% of the reads mapping to toxin sequences and 
8.6% of the total reads (Figure 3A and Table 3). In terms 
of total reads, the SVMPs were the most abundant class of 
toxins in the C. adamanteus venom-gland transcriptome. 
SVMPs are the primary sources of the local and systemic 
hemorrhage associated with envenomation by viperids 
and are divided into a number of subclasses based on 
their domain structure [44,45] . All SVMPs have a metallo- 
proteinase domain characterized by a zinc-binding motif. 
All of the SVMPs identified for C adamanteus belong to 
either the type II or the type III subclass. Type II SVMPs 
(SVMPIIs) have a disintegrin domain in addition to the 
metalloproteinase domain, which may be proteolytically 
cleaved posttranslationally to produce a free disintegrin. 
Type III SVMPs (SVMPIIIs) have a disintegrin-like and a 
cysteine-rich domain in addition to the metalloproteinase 
domain. We found 8 clusters of each of these two sub- 
classes with 23 unique SVMPII sequences and 16 unique 
SVMPIII sequences. SVMPII and SVMPIII clusters com- 
prise 16.4% and 8.0% of the reads mapping to toxins 
respectively (Figure 3). The sequences in both subclasses 
are diverse. The maximum pairwise nt divergence for the 
SVMPIIs was 10.0%, corresponding to a maximum amino- 
acid divergence of 18.1%. For the SVMPIIIs, the maxi- 
mum pairwise nt divergence was 20.4% with a maximum 
amino-acid divergence of 42.3%. Although SVMPs were 
the dominant toxins as a class, the individual SVMP clus- 
ter with the highest abundance was SVMPII-5, which was 
only the eighth most abundant toxin cluster (Figure 2B 
and Table 3). 

Mackessy [46] categorized rattlesnake venoms as type 
I or type II on the basis of their toxicities and metallo- 
proteinase activities. These two measurements tend to be 
inversely related in rattlesnakes: species (or populations) 
with low LD50 values tend also to have low or undetectable 
hemorrhagic activities. SVMPs are the major hemorrhagic 
components of snake venoms, and high toxicity appears 
to be caused mostly by neurotoxic venom components. 
Low-toxicity venoms with high metalloproteinase activity 



are classified as type I, and high-toxicity venoms with low 
metalloproteinase activity are classified as type II. On the 
basis of the abundance of SVMPs in the venom-gland 
transcriptome, C. adamanteus clearly has type I venom, 
although the relatively low toxicity of its venom [46] is 
at least partially compensated for by its large size and 
venom yield. 

C-type lectins 

The most diverse and the second most abundant toxin 
class in the C. adamanteus venom-gland transcriptome 
was the C-type lectin (CTL) class. We identified 37 unique 
sequences and 21 clusters of CTLs that accounted for 
22.2% of the reads mapping to toxins and 7.8% of the 
total reads (Figure 3 A and Table 3). CTLs generally either 
inhibit or activate components of plasma or blood-cell 
types, thereby interfering with hemostasis [47]. Most 
known snake-venom CTLs function as heterodimers or 
even more complex arrangements [48], probably account- 
ing in part for their diversity. The divergence among 
members of this class within the C. adamanteus genome 
was extreme, although all members preserved a CTL-like 
domain. Some pairs shared virtually no conserved amino- 
acid positions. Three of the CTL clusters provide evidence 
for the relevance of alternative splicing in the generation 
of toxin proteins. CTL-3f, CTL-4e, and CTL9b all have 48- 
nt insertions in the same region but are otherwise similar 
or identical to other members of their clusters. 

Snake venom serine proteinases 

The third most abundant toxin class for C. adaman- 
teus was the snake-venom serine proteinases (SVSPs). We 
identified 18 unique sequences and 14 clusters in this 
toxin class, accounting for 20.0% of the toxin reads and 
7.1% of the total reads (Figure 3 A and Table 3). Three of 
the 10 most highly expressed individual toxins were SVSPs 
(Figure 2). SVSPs interfere with a wide array of reactions 
involving blood coagulation and hemostasis and belong to 
the trypsin family of serine proteases [49,50]. Mackessy 
[46] detected significant thrombin-like and kallikrein- 
like activity in the venom of C. adamanteus, which are 
attributable to the action of SVSPs. The diversity of SVSPs 
within the C adamanteus genome is high; maximum pair- 
wise nt divergence is 20.6% and amino-acid divergence 
is 47.4%. 

The members of two SVSP clusters differ in a way 
that should be noted. The lengths of SVSPs are gener- 
ally well conserved throughout the class. SVSP-7a has a 
27-nt insertion relative to the two other members of its 
cluster but is otherwise identical to SVSP-7b. This differ- 
ence could reflect the presence of alternative splicing for 
this gene. SVSP-3a is unique among the C. adamanteus 
SVSPs or those known from other snake species in appar- 
ently having a 65-amino-acid extension of its C-terminal 
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region. The other member of its cluster, SVSP-3b, has a 
single deletion of a C nt in a poly-C tract that terminates 
its coding sequence consistently with other known SVSPs. 
The reads generating the SVSP-3a form vastly outnum- 
ber those for the SVSP-3b form; more than 95% of the 
reads support the extended version of the protein. The 
effect, if any, of this C-terminal extension remains to be 
determined. 

Phospholipase A2 's 

Previous work with C adamanteus identified only a single 
phospholipase A2 (PLA2) sequence [11], but we iden- 
tifed seven unique sequences in six clusters (Figure 2 
and Table 3), accounting for 7.8% of the toxin reads and 
2.8% of the total reads (Figure 3). PLA2s are among the 
most functionally diverse classes of snake-venom toxins 
and have pharmocological effects ranging from neuro- 
toxicity (presynaptic or postsynaptic) to myotoxicity and 
cardiotoxicity. Anticoagulant and hemolytic effects due to 
PLA2s are also known [51,52]. Compared to other toxin 
classes of C. adamanteus, the diversity of PLA2s is low. 
Five of the six clusters are all within 5% nt divergence of 
one another. PLA2-3 is the lone, high-divergence outlier, 
differing by more than 31% at the nt level from the other 
clusters. PLA2-3 is also expressed at the lowest level of any 
of the PLA2S (Table 3). 

Other high-abundance toxins 

The SVMPs, CTLs, SVSPs, and PLA2s account for 74% of 
the reads mapping to toxin sequences (Figure 3), 73% of 
the toxin clusters, and 82% of the unique toxin sequences. 
The remaining toxins belong to 16 different classes. Many 
of these are low-abundance transcripts (Figure 2 and 
Table 3) and may not actually function as significant 
toxins, whereas several others have high to moderate 
abundances and represent significant components of the 
venom. 

The most abundant toxin transcript and the most 
abundant transcript overall (Figure 2) was a small basic 
myotoxin related to crotamine [53,54]. The precursor pro- 
tein is just 70 amino acids in length with a predicted 22- 
amino-acid signal peptide. This transcript was detected 
by Rokyta et al. [11], but the coding sequence was prema- 
turely truncated in their sequence because of a single nt 
deletion. This toxin accounts for 16.8% of the toxin reads 
(Figure 3A) and 5.9% of the total reads. Crotamine, origi- 
nally isolated from the venom of C. durissus, causes spas- 
tic paralysis in mice and is found in the venoms of many 
species of Crotalus [54]. Muscle spasms, twitching, and 
paralysis of the legs have been reported for human enven- 
omations by C. adamanteus [20]. Interestingly, Straight 
et al. [55] noted that individuals of C. adamanteus from 
populations in southern and central Florida lack this toxin 
in their venoms. Given that this myotoxin is the most 



abundant transcript in the venom of our specimen, its 
absence in southern populations points to a dramatic dif- 
ference in venoms within this species and the potential for 
significantly different pathological effects associated with 
bites from different C adamanteus populations. 

A single L-amino-acid oxidase (LAAO) transcript was 
the second most abundant toxin transcript (Figure 2B), 
consistent with the previously detected LAAO activity 
in the C. adamanteus venom [46]. This single tran- 
script accounted for 5.3% of the reads mapping to toxins 
and 1.9% of the total reads. LAAOs are flavoproteins, 
giving the venom its yellow color; can be edema- or 
apoptosis-inducing; and can induce or inhibit platelet 
aggregation [56]. These effects are probably mediated by 
H2O2 released during the oxidation reaction catalyzed 
by the enzyme. The 29th most abundant toxin transcript 
was a cysteine-rich secretory protein (CRISP) (Figure 2B 
and Table 3), accounting for 1.3% of the toxin reads 
(Figure 3A). Although CRISPs are widely found in snake 
venoms, their precise effects are not well established 
[57], but they appear to interfere with smooth-muscle 
contraction [58,59]. A single transcript for a bradykinin- 
potentiating and C-type natriuretic peptide transcript 
(BPP) was found to account for 0.7% of the toxin reads 
(Figure 3A). The encoded protein is similar to a pro- 
tein identified in Sistrurus catenatus (GenBank accession: 
DQ464265) that was hypothesized to reduce blood pres- 
sure in envenomated prey [25]. A loss of blood pres- 
sure has been reported in human envenomations by 
C. adamanteus [20]. 

Other low-abundance toxins 

The remaining 17 clusters are classified as "others" in 
Figure 3A. Because each has a relatively low expression 
level (Table 3), many of these should be considered puta- 
tive toxins until their presence in the C. adamanteus 
venom is confirmed proteomically and pharmacological 
effects are associated with them. 

Rokyta et al. [11] detected the presence of a transcript 
encoding a protein homologous to ohanin from Ophioph- 
agus hannah [60,61] and to a homologous protein from 
Lachesis muta [62]; we found a transcript identical to 
that of Rokyta et al. [11]. Pung et al. [60,61] found the 
O. hannah version of this protein to increase pain sensi- 
tivity (hyperalgesia) and to induce temporary hypoloco- 
motion in mice and proposed naming the class vespryns 
(VESP). Exceptionally intense pain has been reported 
after envenomation of humans by C. adamanteus [20], 
although whether such pain is due to a specific toxin is 
not clear. 

We detected three different nucleotidases (NUCs) and 
five different phosphodiesterases (PDEs) in the venom- 
gland transcriptome of C. adamanteus. Only one of the 
NUCs and three of the PDEs had signal peptides, and 



Rokyta et al. BMC Genomics 201 2, 1 3:3 1 2 
http://www.biomedcentral.eom/1 471 -2 1 64/1 3/3 1 2 



Page 12 of 23 



we therefore only considered these as potential toxins: 
NUC, PDE, PDE-4, and PDE-6 (Table 3). The roles of 
these enzymes in venoms are uncertain, but their primary 
function may be to liberate toxic nucleosides [63-65]. 
Significant PDE activity has been detected previously in 
the venom of C. adamanteus [46]. 

The C. adamanteus venom-gland transcriptome con- 
tained three Kunitz-type protease inhibitors (KUNs). 
Two of these shared more than 75% animo-acid identity 
with a KUN from Austrelaps labialis (GenBank acces- 
sion: B2BS84), an Australian elapid. All three KUNs have 
domains that place them in the superfamily of bovine pan- 
creatic trypsin-like inhibitors, and snake toxins from this 
family are known to inhibit plasma serine proteinases. 
Although KUNs are commonly observed in snake venoms, 
their role in envenomation (if any) is not well defined [66] . 
The three KUNs detected for C. adamanteus are all at rel- 
atively low abundances, suggesting that they are not major 
components of the venom. 

We identified two transcripts, HYAL-1 and HYAL-2, 
encoding hyaluronidase-like proteins. Hyaluronidases are 
generally regarded as venom components that promote 
the dissemination of other venom components by degrad- 
ing the extracellular matrix at the site of injection [67], 
although they may have more direct toxic eff^ects [68]. 
The coding sequences of our two transcripts diff'er only 
in the presence of a 765-nt deletion in HYAL-2 relative to 
HYAL-1. Truncated hyaluronidases such as HYAL-2 have 
been detected in the venoms of other viperid species [67] 
and may represent an example of alternative splicing. We 
also identified a transcript encoding a glutaminyl-peptide 
cyclotransferase (glutaminyl cyclase; GC). Many snake 
venom components have N termini blocked by pyrogluta- 
mate, and GCs catalyze the formation of this block. This 
component is related more to maturation and protection 
of other toxins and probably contributes only indirectly to 
toxicity [69]. 

We identified six growth-factor-related sequences in the 
venom-gland transcriptome of C. adamanteus: a nerve 
growth factor (NGF), a neurotrophic factor (NF), two 
vascular endothelial growth factors (VEGF) in a sin- 
gle cluster, and a cysteine-rich with EGF-like domain 
protein (CREGF). The NGF transcript encodes a 241 
amino-acid precursor protein and shares 99% amino- 
acid identity with a NGF from C. durissus (GenBank 
accession: AAG30924). The NF transcript encodes a 180- 
amino-acid precursor that shares homology with mes- 
encephalic astrocyte-derived neurotrophic factors. We 
found no close venom-related sequences for this NF in 
the available databases. The VEGF sequences appear to 
be alternatively spliced versions of one another. VEGF- 
la encodes a 192-amino-acid precursor, and VEGF-lb 
encodes a 148-amino-acid precursor. Aside from the 
132-nt deletion in VEGF-lb relative to VEGF-la, their 



coding sequences are identical. Both forms have database 
matches of the same length with 99% amino-acid iden- 
tity from Trimeresurus flavoviridis (GenBank accessions: 
AB154418 and AB154419). Finally, we detected the same 
cysteine-rich with EGF-like domain protein as described 
by Rokyta et al [11]. 

The final two putative toxin transcripts are of question- 
able significance because of their low expression levels. 
A single sequence with 77% amino-acid identity to a 
waprin (WAP) sequence from Philodryas olfersii (Gen- 
Bank accession: EU029742), a rear-fanged colubrid, was 
detected. Related sequences have been detected in a vari- 
ety of other rear-fanged snake species, but such proteins 
are only known to exhibit antimicrobial activity [70] . We 
detected a venom factor (VF) transcript that shares 87% 
animo-acid identity with a VF from Austrelaps superbus 
(GenBank accession: AY903291) [71]. The C. adamanteus 
VF transcript encodes a 1,652-amino-acid precursor with 
a 22-amino-acid signal peptide. The best-studied member 
of this toxin family is cobra venom factor, which is known 
to activate the complement system [72]. The extremely 
low expression levels of these transcripts may indicate that 
they represent the orthologous genes to the ancestors of 
the known toxic forms and may therefore have no toxic 
functions. 

Comparison to previous work 

Rokyta et al. [11] previously described toxin transcripts in 
the venom-gland transcriptome of C. adamanteus on the 
basis of 454 pyrosequencing. Their work used RNA from 
the venom gland of the same individual used in the present 
work. They found 40 unique toxin transcripts, 10 of which 
contained only partial coding sequences. Table 4 lists the 
closest matches from our current sequences to those of 
Rokyta et al. [11]. The vast majority of the 454-based 
sequences had either identical matches in our current 
set of toxins or matches with less than 1% nt divergence 
(Table 4). Only a single 454 toxin, SVSP-9, did not have 
a close match. This sequence contains only a partial cod- 
ing sequence and therefore may not represent a true, 
functional toxin. 

Nontoxin transcripts 

We characterized the nontoxin genes expressed in the 
C. adamanteus venom gland by two means. First, we 
took all of the contigs from one of our four de novo 
NGen assemblies based on 20 million merged reads and 
conducted a full Blast2Go [73] analysis on the contigs 
comprising >100 reads. Of the 12,746 contigs (assem- 
bly 2 in Table 2), we were able to provide gene ontology 
(GO) annotations for 9,040 of them (Figure 4A). The 
major functional classes (level 2) represented in these 
results were binding and catalysis, followed by transcrip- 
tion regulation (Figure 4B). The major biological process 
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Table 4 Correspondence with the results of Rokyta et al. [1 1 ] 



454 name 


Accession 


Closest match 


% nt divergence 


Notes 


CREGF 


HQ414087 


CREGF 


0.1 




CRISP 


HQ414088 


CRISP 


0.0 


Identical 


CTL-1 


HQ414089 


CTL-4a 


0.0 


Identical 


CTL-2 


HQ414090 


CTL-8a 


0.0 


Identical 


CTL-3 


HQ414091 


CTL-1 


0.0 


Identical 


CTL-4 


HQ414092 


CTL-9a 


0.8 




CTL-5 


HQ414093 


CTL-3e 


0.9 




CTL-6 


HQ414094 


CTL-1 2b 


0.0 


Identical 


CTL-7 


HQ414095 


CTL-1 0 


0.0 


Identical 


CTL-8 


HQ414096 


CTL-2a 


0.0 


Identical 


CTL-9 


HQ414097 


CTL-5a 


0.0 


Identical 


HYAL 


HQ414098 


HYAL-1 


0.0 


454 version incomplete 


LAAO 


HQ414099 


LAAO 


0.0 


Identical 


MYO 


HQ414100 


MYO 


0.0 


454 version has 1 -nt deletion that truncates the coding sequence prennaturely 


NUC 


HQ414101 


NUC 


0.0 


454 version incomplete 


PDE-1 


HQ414102 


PDE 


0.0 


454 version has 1 23-nt insertion 


PDE-2 


HQ414103 


PDE-2 (nontoxin) 


0.0 


454 version incomplete; no signal peptide; no longer considered toxin 


PLA2 


HQ414104 


PLA2-1b 


0.0 


Identical 


PLB 


HQ414105 


PLB (nontoxin) 


0.2 


No longer considered toxin 


SVMP-1 


HQ414106 


SVMPII-3b 


0.0 


Identical 


SVMP-2 


HQ414107 


SVMPII-3b/c 


0.5 




SVMP-3 


HQ414108 


SVMPII-5a 


0.3 




SVMP-4 


HQ414109 


SVMPIII-2d 


1.2 




SVMP-5 


HQ414110 


SVMPIII-4b 


1.0 




SVMP-6 


HQ414111 


SVMPIII-2d 


0.2 




SVMP-7 


HQ414112 


SVMPIII-4a 


0.0 


Identical 


SVMP-8 


HQ414113 


SVMPIII-5 


0.5 


454 version incomplete 


SVMP-9 


HQ414114 


SVMPIII-la/b 


0.0 


454 version incomplete 


SVMP-1 0 


HQ414115 


SVMPIII-6 


0.0 


454 version incomplete 


SVMP-1 1 


HQ414116 


SVMPIII-3a 


0.0 


454 version incomplete 


SVSP-1 


HQ414117 


SVSP-3a 


0.0 


454 version has 1 -nt deletion that truncates the coding sequence prematurely 


SVSP-2 


HQ414118 


SVSP-1 


0.0 


Identical 


SVSP-3 


HQ414119 


SVSP-7a 


0.0 


Identical 


SVSP-4 


HQ414120 


SVSP-5 


0.1 




SVSP-5 


HQ414121 


SVSP-9 


0.5 




SVSP-6 


HQ414122 


SVSP-6 


0.0 


Identical 


SVSP-7 


HQ414123 


SVSP-4b 


0.0 


454 version incomplete 


SVSP-8 


HQ414124 


SVSP-2 


0.0 


454 version incomplete 


SVSP-9 


HQ414125 


None 


> 10 


454 version incomplete 


VESP 


HQ414126 


VESP 


0.0 


Identical 
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Figure 4 Comparison of gene ontology (GO) results for our annotated full-length nontoxin sequences with those of the contigs from a de 
novo assembly with NGen. Only level 2 GO terms are shown. The distributions of GO terms are similar across data sets, suggesting that the 
annotated transcripts provided a comprehensive characterization of the genes expressed in the venom gland. (A) The distributions of sequences 
reaching various stages of identification and annotation are shown. The level 2 GO terms are shown for molecular function (B), biological process 
(C), and cellular component (D). 



GO terms (level 2) were cellular processes and metabolic 
processes (Figure 4C). Interestingly, viral reproductive 
function was detected and probably represents the activ- 
ity of transposable elements or retroviruses like those 
previously noted in snake venom-gland transcriptomes 
[34]. The major cellular component GO terms (level 2) 
were cell and organelle (Figure 4D). For these results, we 
made no attempt to exclude toxin sequences, because they 
are necessarily a small minority of the total sequences, 
and did not require that contigs contain full-length 
coding sequences. 

For our second approach, we used only the 2,879 tran- 
scripts with full-length coding sequences for nontoxin 
proteins. We analyzed these sequences with Blast2GO. 
The distributions of level 2 GO terms for these data 
were almost identical to those of the full NGen assem- 
bly described above (Figure 4), suggesting that our 2,879 
annotated nontoxin sequences provide a representative 
sample of the full venom-gland transcriptome. The full 



distributions of GO terms for these sequences across all 
levels are shown in Figures 5, 6, and 7. As expected for 
a secretory tissue, processes related to protein produc- 
tion and secretion were well represented (e.g., protein 
transport and protein modification; Figure 5), as were 
protein-binding functions (Figure 6) and proteins local- 
ized to the endoplasmic reticulum (ER) and the Golgi 
apparatus (Figure 7). 

Four of the top 20 most highly expressed nontoxin 
genes (Table 5), including the most highly expressed, 
were protein disulfide isomerases (PDIs). In particular, 
they were members of the PDI family that is retained 
in the ER and are characterized by having two or more 
PDI domains, which are similar to thioredoxin. PDIs cat- 
alyze the formation or breaking of disulfide bonds and are 
therefore involved in protein folding. Molecular chaper- 
ones were well represented in the top 20 nontoxins by 
four genes: endoplasmin (a member of the HSP90 family), 
calreticulin, 78-kDA glucose-regulated protein (GRP78), 
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Figure 5 The biological-process GO terms identified for the 2,879 annotated full-length nontoxin sequences. Terms specific for the 
production, processing, and export of proteins are liigliliglited in blacl<. Tine inset sliows tlie low-abundance portion of tlie full distribution. 



and heat shock protein 5. The latter gene appears to be 
a splice variant of GRP78, differing within the coding 
region by two point mutations and two short deletions. 
All of these chaperones are ER specific. Six of the top 20 
nontoxins were mitochondrial genes involved in oxida- 
tive cellular respiration, consistent with the high ener- 
getic demands of venom production [74]: cytochrome C 
oxidase subunits I and III, cytochrome B, and NADH 
dehydrogenase subunits 1, 4, and 5. The cells of venom 
glands are particularly rich in mitochondria [75]. Four 
genes were involved in various aspects of translation: 
two translation elongation factors, 18S rRNA, and vig- 
ilin. Vigilins are hypothesized to be involved in reg- 
ulating mRNA stability and translation and might be 
involved in RNA-mediated gene silencing [76,77]. The 
final top 20 nontoxin gene was actin, a component of the 
cytoskeleton. 

The abundances of several major classes of nontoxins 
are provided in Figure 3B. We identified 57 sequences with 
functions related to protein folding [19,78-80], including 
various classes of heat-shock proteins, protein-disulfide 
isomerases, peptidyl-prolyl cis-trans isomerases, dnaj- 
complex components, and T-complex components. These 
sequences together accounted for 28.4% of the total 
reads mapping to nontoxins. Ribosomal-protein tran- 
scripts (cytoplasmic and mitochondrial) accounted for 



9.5% of the nontoxin reads, and mitochondrial genes 
accounted for another 9.0%. Finally, we identified 110 
sequences transcripts encoding proteins involved in pro- 
tein degradation [81,82], including proteins involved in 
the ubiquitin-proteasome system and the ER-associated 
protein-degradation system [83], which accounted for 
2.6% of the nontoxin reads. Protein-quality control should 
be essential in a high-throughput protein-producing tis- 
sue such as a snake venom gland. 

Our collection of nontoxins included several notable 
potential inhibitors of the toxins or other proteases 
(Table 6). Such inhibitors may play a role in preventing 
autolysis [84] or may serve to protect venom components 
once inside a victim [85]. We detected three cystatin-like 
transcripts in the venom gland. Cystatins are cysteine- 
protease inhibitors and have been detected in numerous 
elapid venom glands and venoms [85]. We detected three 
unique metalloproteinase inhibitors and two serine pro- 
teinase inhibitors (serpins). Finally, we found four unique 
PLA2 inhibitors. 

Sequence accession numbers 

The original, unmerged sequencing reads were submit- 
ted to the National Center for Biotechnology Information 
(NCBI) Sequence Read Archive under accession number 
SRA050594. The annotated toxin and nontoxin sequences 
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Figure 6 The molecular-function GO terms identified for the 2,879 annotated full-length nontoxin sequences. Terms specific for the 
production, processing, and export of proteins are liigliliglited in blacl<. Tine inset sliows tine low-abundance portion of tine full distribution. 



were submitted to the GenBank Transcriptome Shot- 
gun Assembly (TSA) database under accession numbers 
JU173621-JU173743 (toxins) and JU173744-JU176622 
(nontoxins). 

Conclusions 

We have described the most comprehensive venom-gland 
transcriptomic characterization of a snake species to 
date and provided full-length coding sequences for 123 
unique toxin proteins and 2,879 unique nontoxin proteins. 
We have demonstrated the use of Illumina sequencing 
technology for the sequencing and de novo assembly of 
a tissue-specific transcriptome for a nonmodel species, 
C. adamanteus, for which genome-scale resources were 
previously unavailable. Because the nontoxin sequences in 
particular should be conserved across snake species, our 
results should greatly facilitate similar work with other 
venomous species, serving as an assembly template and 
reducing the number of reads for which de novo assembly 
will be necessary. 

The expressed toxin genes in the venom gland of 
C. adamanteus provide a detailed portrait of a type I 
rattlesnake venom [46]. The most abundant transcript 
expressed in the C. adamanteus venom gland encoded 



a myotoxin homologous to crotamine. Crotamine is 
known to induce spastic paralysis [54], a symptom 
that has been observed in human envenomations by 
C. adamanteus [20]. Like those of most viperids, the 
bites of C. adamanteus result in significant tissue dam- 
age and necrosis, and we found that SVMPs, the major 
class of hemorrhagic toxins, dominated venom-gland 
gene expression. The second most abundant toxin tran- 
script overall was an LAAO, which are also noted for 
causing local tissue damage [46]. Coagulopathy is a 
common occurrence with pit- viper bites [5]. The CTLs 
and SVSPs were also both diverse and abundant in the 
venom-gland transcriptome of C. adamanteus, and both 
classes primarily attack the hemostatic system. In terms 
of gene sequences of venom components, the venom of 
C. adamanteus is now the best-characterized snake 
venom, although a thorough proteomic analysis of the 
venom is still needed. The sequences we have generated 
will greatly facilitate such a proteomic characteriza- 
tion by serving as a database against which to query 
mass-spectrum results. 

The expression patterns of the nontoxin genes in the 
venom gland of C. adamanteus reflect the protein- 
secretory function of the tissue and the high energetic 
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demands of rapid venom production [75]. The most highly 
expressed nontoxin genes were those involved in the 
production and processing of proteins and energy pro- 
duction to support these activities. Molecular chaperones 
and PDIs were particularly abundant. Though the expres- 
sion patterns for nontoxins were not surprising, future 
comparisons with other snake species, especially those 
from other snake families, may be able to elucidate the ori- 
gin and early stages of the evolution of the venom gland. 

Methods 

Venom-gland transcriptome sequencing 

We sequenced the venom-gland transcriptome of a sin- 
gle animal from Florida (Wakulla County): an adult female 
weighing 393 g with a snout-to-vent length of 792 mm and 
a total length of 844 mm. To stimulate transcription in 
the venom glands, we anesthetized the snake by propofol 
injection (10 mg/kg) and extracted venom by electros- 
timulation under anesthesia [86]. After venom extraction, 
the animal was allowed to recover for four days while 
transcription levels reached their maxima [87]. The snake 
was euthanized by injection of sodium pentobarbitol (100 
mg/kg), and its venom glands were subsequently removed. 
The above techniques were approved by the Florida State 



University Institutional Animal Care and Use Committee 
(lACUC) under protocol #0924. 

Sequencing and nonnormalized cDNA library prepa- 
ration were performed by the HudsonAlpha Institute 
for Biotechnology Genomic Services Laboratory (http:// 
www.hudsonalpha.org/gsl/). Transcriptome sequencing 
was performed essentially as described by Mortazavi 
et al. [88] in a modification of the standard Illumina 
methods described in detail in Bentley et al. [89]. 
Total RNA was reduced to poly-A+ RNA with oligo- 
dT beads. Two rounds of poly-A+ selection were per- 
formed. The purified mRNA was then subjected to 
a mild heat fragmentation followed by random prim- 
ing for first-strand synthesis. Standard second-strand 
synthesis was followed by standard library preparation 
with the double-stranded cDNA as input material. This 
approach is similar to that of lUuminas TruSeq RNA- 
seq library preparation kit. Sequencing was performed in 
one lane on the Illumina HiSeq 2000 with 100-base-pair 
paired-end reads. 

Transcriptome assembly and analysis 

The average insert length of our cDNA library was ~170 
nt, excluding the Illumina adaptors. With 100-base-pair 
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Table 5 The 20 most highly expressed nontoxin transcripts 



Name 


Length 


% reads 


Function 


Accession 


Protein disulfide isomerase 


2970 


5.223 


Rearrange disulfide bonds (ER) 


JU1 75360 


Cytochrome C oxidase subunit 1 


2789 


0.966 


Electron transport chain 


JU 175042 


Cytochrome B 


1275 


0.499 


Electron transport chain 


JU 175040 


Translation elongation factor 1 a 1 


1985 


0.459 


Translation 


JU 174424 


18S rRNA 


2509 


0.421 


Ribosomal component 


JU1 73759 


Calreticulin 


1661 


0.406 


Protein chaperone (ER) 


JU 174061 


Endoplasmin (HSP90 family) 


3012 


0.333 


Protein chaperone (ER) 


JU 174456 


78 kDa glucose-regulated protein 


2676 


0.332 


Protein chaperone (ER) 


JU174713 


Heat shock protein 5 (GRP78 splice variant?) 


2063 


0.327 


Protein chaperone (ER) 


JU 174801 


NADH dehydrogenase subunit 5 


4448 


0.272 


Electron transport chain 


JU175113 


Cytochrome C oxidase subunit III 


2103 


0.239 


Electron transport chain 


JU 175043 


Protein disulfide isomerase A6 


4933 


0.212 


Rearrange disulfide bonds (ER) 


JU175358 


Nucleobindin 2 


2937 


0.203 


Calcium binding 


JUl 75278 


Protein disulfide isomerase A3 


1904 


0.186 


Rearrange disulfide bonds (ER) 


JU1 75355 


NADH dehydrogenase subunit 1 


1878 


0.173 


Electron transport chain 


JU175111 


NADH dehydrogenase subunit 4 


1751 


0.172 


Electron transport chain 


JU175112 


Protein disulfide isomerase A4 


2540 


0.159 


Rearrange disulfide bonds (ER) 


JUl 75356 


Translation elongation factor 2 


3057 


0.147 


Translation 


JUl 74429 


Vigilin 2 


6107 


0.129 


mRNA stability and translation 


JU176512 


Actin, cytoplasmic 2 


1971 


0.124 


Cytoskeleton 


JUl 73777 



paired-end sequencing, the majority of paired-end reads 
overlapped at their 3' ends. Because read quality declines 
toward the 3' ends of reads, we developed a method 
similar to that of Rodrigue et al. [36] for merging the 
overlapping pairs into single, long, high-quality reads. 
The members of each pair of reads were slid along 
each other, and, for each overlap of length we cal- 
culated the probability of getting the observed number 



of matches k by chance using a binomial probability 
given by 

assuming any of the four nucleotides is equally likely to be 
at any position. To be conservative, we only merged reads 



Table 6 Toxin and protease inhibitors detected in the venom-gland transcripts 


Name 


Length 


% reads 


Function 


Accession 


Cystatin 1 


790 


9.80 X 10-4 


Cysteine-protease inhibitor 


JUl 74278 


Cystatin B 


460 


1.23 X 10-3 


Cysteine-protease inhibitor 


JUl 74279 


Cystatin 2 


709 


1.31 X 10-3 


Cysteine-protease inhibitor 


JUl 74280 


Metalloproteinase inhibitor 1 


820 


1 .04 X 1 0-3 


Metalloproteinase inhibitor 


JU175124 


Metalloproteinase inhibitor 2 


2560 


2.57 X 10-3 


Metalloproteinase inhibitor 


JU175125 


Metalloproteinase inhibitor 3 


2202 


1.01 X 10-3 


Metalloproteinase inhibitor 


JU175126 


PLA2 inhibitor beta 


1210 


1.54 X 10-3 


PLA2 inhibitor 


JUl 75425 


PLA2 inhibitor gamma B 1 


1492 


3.68 X 10-3 


PLA2 inhibitor 


JUl 75444 


PLA2 inhibitor gamma B 2 


1694 


7.70 X 10-4 


PLA2 inhibitor 


JUl 75442 


PLA2 inhibitor B 


2339 


1.07 X 10-3 


PLA2 inhibitor 


JUl 75443 


Serpin B6 


1708 


9.68 X 10-3 


Serine-proteinase inhibitor 


JUl 75869 


Serpin HI 


2004 


9.40 X 10-4 


Serine-proteinase inhibitor 


JUl 75870 
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if the minimum probability was less than 10~^^ and the 
second smallest probability was at least 1000 times larger 
(Figure lA). The latter condition was meant to help avoid 
merging reads that span highly repetitive regions. For 
cases in which the insert size was less than the read length, 
sequence data outside the overlap were assumed to repre- 
sent adaptors and were deleted. We updated quality scores 
for the overlapping positions following the approach of 
Rodrigue et al. [36]. For merged reads, quality scores for 
nonoverlapping bases were left unchanged (Figure IB). 
The unmerged reads were typically those pairs from the 
longer end of the insert- size distribution. 

Because of the inherent difficulty in de novo tran- 
scriptome assembly, we used a diverse array of assembly 
approaches and combined the results for a final data 
set. We performed assemblies using ABySS version 1.2.6 
[37,38] under a wide array of parameter values using both 
the merged and unmerged reads. In particular, we used 
/c-mer values of 51, 61, 71, 81, and 91 and varied the 
coverage (c) and erode [e) parameters from 2 to 1,000. 
We set £ = 0, m = 20, and s = 200 for all assem- 
blies. Trans-ABySS [90] provided little or no improvement 
of our assemblies, primarily because assembly quality 
appeared to be more dependent on the coverage and 
erode parameters than on the /c-mer length. We also con- 
ducted assemblies using both the merged and unmerged 
reads with Velvet version 1.1.02 [39] and /c-mer values 
of 71, 81, and 91. We selected the best of these assem- 
blies on the basis of the N50 values for further assembly 
into transcripts with Oases version 0.1.20 (http://www.ebi. 
ac.uk/~zerbino/oases/) [40]. For Oases, we set the mini- 
mum transcript length to 300 nt and the coverage cutoff 
to 10. We also followed the approach of Rokyta et al. 
[11] and used the NGen2.2 assembler from DNAStar 
(http://www.dnastar.com/). Because this assembler is lim- 
ited to 20-30 million reads, we used only the merged 
reads. We performed four independent assemblies: three 
with 20 million merged reads each and one with the 
remaining 12,114,709 merged reads. Each assembly was 
performed with the default settings for high-stringency, 
de novo transcriptome assembly for long lUumina reads, 
including default quality trimming. The high-stringency 
setting corresponded to setting the minimum match per- 
centage to 90%. We retained contigs comprising at least 
100 reads. 

In addition to the all-at-once assembly approaches 
above, we developed an iterative approach that was both 
more effective at generating full-length transcripts and 
more computationally efficient. The first step consisted of 
applying our Extender program (see below) as a de novo 
assembler starting from 1,000 reads. Full-length tran- 
scripts were identified with blastx searches (see below), 
then used as templates in a reference-based assembly in 
NGen3.1 with a 98% minimum match percentage to filter 



reads corresponding to identified transcripts. Ten million 
of the unassembled sequences were then used in a de 
novo transcriptome assembly in NGen3.1 with the same 
settings as described above for de novo assembly except 
that the minimum match percentage was increased to 93% 
and contigs comprising less than 200 sequences were dis- 
carded. The resulting sequences were identified, where 
possible, by means of blastx searches, and the identified 
full-length transcripts were used in another templated 
assembly to generate a further-reduced set of reads. This 
iterative process was repeated two additional times. 

To provide transcriptional profiles of the venom gland, 
we performed GO annotation with Blast2GO [73]. We 
ran full analyses on one of NGen assemblies of 20 mil- 
lion merged reads, including blastx searches, GO map- 
ping, and annotation. We used the default Blast2GO 
parameters throughout. We converted the GO anno- 
tation to generic GO-slim terms. We ran the same 
analysis on the combined set of annotated nontoxin 
sequences. 

For gene identification and annotation, we conducted 
blastx searches using mpiblast version 1.6.0 (http://www. 
mpiblast.org/) of the consensus sequences of contigs of 
our assemblies against the NCBI nonredundant pro- 
tein database (nr; downloaded March 2011 and updated 
through November 2011). We used an E- value cut-off of 
10""^, and only the top 10 matches were considered. For 
toxin identification, hit descriptions were searched for 
a set of keywords based on known snake-venom toxins 
and protein classes. Any sequence matching these key- 
words was checked for a full-length coding sequence. We 
generally only retained transcripts with full-length cod- 
ing sequences (but see below). For the iterative assembly 
approach, the remaining, presumably nontoxin-encoding, 
contigs were screened for those whose match lengths 
were at least 90% of the length of at least one of their 
database matches. This step was intended to minimize 
the number of fragmented or partial sequences that were 
considered for annotation. In addition, we sorted the 
contigs of the three 20-million-sequence NGen assem- 
blies from the all-at-once approach on the basis of the 
number of reads and attempted to annotate the top 500 
contigs from one assembly and the top 100 from the 
other two. 

We estimated transcript abundances using high- 
stringency reference-based assemblies in NGen3.1 with 
a minimum match percentage of 95. Ten million of the 
merged reads were mapped onto the full-length, anno- 
tated transcripts, and the percentage of reads mapping to 
each transcript was used as a proxy for abundance. 

The extender 

The purpose of Extender is to estimate quickly one 
or more full-length transcript sequences from a large 
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number of high-quality sequence reads. The procedure 
begins with one or more seed sequences provided by the 
user. The seeds can be known sequences (e.g., partial 
transcripts from a previous assembly) or simply sequences 
of one or more of the reads. The Extender procedure 
begins by hashing the k-mers observed at the two ends 
of the seeds. If k is set to 50, for example, then the 50- 
base sequence present at the 5' end of each seed is used 
as a key in a hash table, and the hash value is a pointer 
to the seed in the list of seeds. A second hash table is 
likewise used for /c-mers from the 3' ends of the seeds. 
Note that this method requires that all initial /c-mers be 
unique (that no two sequence ends be identical). Once 
the seeds are hashed, the seeds are extended with the 
set of reads provided by the user as follows. The two k- 
mers from the ends of each read are looked up in each 
hash table. If the key is present in the hash table, the 
seed is extended by concatenation of the nonoverlap- 
ping bases from the read onto the appropriate end of the 
seed. If the key is absent, the reverse complement of the 
read is used to extend the seed if the end /c-mers are 
found. After each extension, the /c-mer key facilitating the 
extension is removed from the hash table and the new 
A"-mer key is added (the reference to the seed remains 
the same). The procedure is repeated until the reads have 
been cycled through N times, where N is chosen by the 
user. Cycling is beneficial because the Extender does not 
reset to the beginning of the read list when an extension 
is made. 

Extension of a seed typically terminates when the end 
of the full-length transcript is reached or when a sequenc- 
ing error is encountered in the end of an incorporated 
read. The presence of low-frequency biological artifacts 
(e.g., unspliced introns) may also result in termination 
of the extension. In order to improve the accuracy of 
the consensus sequence prediction. Extender can cre- 
ate replicate seeds for a particular seed by sequentially 
trimming one base at a time from both ends. Using 
replicate seeds allows several independent sequences that 
represent the same target consensus sequence to be gen- 
erated simultaneously, and these replicates are entirely 
independent because they begin with different keys. The 
user can obtain the final estimate of the sequence cor- 
responding to each original seed by taking the consen- 
sus across replicates or by simply choosing the replicate 
producing the longest sequence. We took the former 
approach for all of our assembly efforts. Overall, Extender 
is highly inefficient with its use of data and requires many 
long, high-quality reads, but it is extremely computation- 
ally efficient, having short run times and low memory 
requirements. 

We used Extender in two different ways: to complete 
partial toxin transcripts and as a de novo assembler. For 
the former, we used partial toxin transcripts from NGen 



assemblies that were found to have fragments of cod- 
ing sequence homologous with known toxins. The par- 
tial transcripts were trimmed to just the partial coding 
sequence and used as seeds. To use Extender as a de novo 
assembler, we seeded it with 1,000 random reads. For 
both applications, we used a k-mer size of 100, 20 repli- 
cates, 10 cycles through the complete set of merged reads 
excluding all reads with any bases with quality scores less 
than 30. 
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